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1 . Introduction 

Most standardized educational testing has used multiple-choice (MC) response 
format. However, in recent years, there has been increasing concern that MC tests 
may be too limited in the skills they tap. As a consequence, alternative response 
formats have been developed and many are being used in several educational 
contexts (Bennet and Ward, 1993), a fact which sparked renewed interest in the 
enduring question of whether tests of the same content that employ different 
response formats measure the same traits. For example, many empirical studies on 
the equivalence of multiple-choice and constructed response (Discrete) (CR-D) formats 
have been reported. However, their results have not been conclusive and many were 
seriously flawed in design and analysis (Traub and MacRury, 1 990). In general, these 
results suggest that MC and CR-D tests of the same content cannot be assumed to 
be equivalent and that format effect is not uniform across subject matters. It is also 
conceivable that format effect is not uniform across ages of examinees. With regard 
to subject matter, Traub (1 993) concludes that for the quantitative domain, the two 
formats probably do not measure different traits. 

In the math computation domain, it is hypothesized that, regardless of the 
format, items will require the calculation of the answer and that answers to math 
computation items will not be recognized by most examinees when answering a MC 
test. In other words, a MC and a CR-D forms of the same stem will be processed in 
the same way. Nevertheless, it has generally been assumed that correct answers to 
MC items can be guessed at more readily than CR-D items, it is thus expected that 
MC tests are less difficult, less discriminating and less reliable than CR-D tests of the 
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same content. In addition, having multiple answers - one of which is the correct one 
- may alert the examinee who makes a mistake in the computation and ends up with 
an answer which is not on the list of choices, to check and/or redo the computation. 
Such guidance is not available with the CR-D format and can result in the MC format 
to have reduced relative difficulty. However, these expectations are not consistently 
supported by findings of empirical research (Traub & MacRury, 1990). 

2. Objectives 

The main purpose of this study is to test the equivalence of MC and CR-D 
response formats as applied to mathematics computation at grade levels two to six. 
This is carried out in two steps. First, the difference between total scores from the 
two response formats is tested for statistical significance and the factor structure of 
the items in both response formats is compared. Second, if, based on results 
obtained in the first analysis, we fail to reject the hypothesis that the two response 
formats measure the same traits, their relative difficulty and reliability will be 
compared. 

3. Data and Methods 

Data for this study consist of the responses of 1 028 students in grades two 
to six to the mathematics computation component of the Canadian Achievement 
Tests, Second Edition (CAT/2) (Canadian Test Centre, 1992). Stem-equivalent and 
scoring-equivalent MC and CR-D forms were used and each student was tested twice 
with a time lapse of two to four weeks between the two testings. Students at each 
grade level were divided into four groups; two groups were retested with the 
alternate format and the other two groups were retested with the same format. That 
is, response formats were used in four testing sequences; namely, CR-D/CR-D, CR- 
D/MC, MC/MC and MC/MC. Nine schools from across Canada, five of them located 
in rural areas, participated in the study. Teachers administering the test were 
instructed not to review the first test material with their students, not to teach them 



1 



to the test, and not to tell students that they will be writing a second test of the same 
content. 



Generalized linear models (GLM) procedure in the statistical computing package 
SAS is used to carry out a repeated measures analysis of variance in which 
components of variance due to carryover effects and format effects are estimated and 
tested for statistical significance. Factors included in this model are entered in the 
following order: (1 ) testing sequence: which includes the four categories, CR-D/CR-D, 
CR-D/MC, MC/CR-D and MC-MC, (2) student: which is nested within testing 
sequence, (3) order: which indicates first versus second testing, (4) response format, 
and (5) carryover effect. Tests using the hierarchical and the unique sums of squares 
are carried out. 

Scatter plots for the score on the first testing versus the score on the second 
testing are prepared and the linearity of their relationship examined for each testing 
sequence of each grade level. Paired t-tests of the difference between the scores on 
first and the second testing are carried out for each testing sequence at each grade 
level. Also, test-retest reliability and correlation coefficients corrected for attenuating 
effect of errors of measurement are calculated and compared to unity. 



4. Significance of the Study 

The question of equivalence of MC and CR-D response formats is far from 
being resolved. The present study is intended to shed some light on this enduring 
question. It is of importance to know whether or not different formats of the same 
stem measure different traits. It is of equal importance to know what traits are 
measured with each format. Traits measured in different tests inform students and 
teachers about the kind of knowledge and skills that are most important to learn and 
to teach and can thus have direct and indirect consequences for the educational 
system. 

In studies with repeated-measurement design and one group of examinees, the 
CR-D format is usually administered first followed by the MC format. Thus, ignoring 
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carryover effects from CR-D to MC format. On the other hand, studies in which 
examinees are divided into two groups, each tested with both formats in reversed 
orders are bound to make the assumption that carryover effects from MC to CR-D are 
equivalent to carryover effects from CR-D to MC. Such assumption is not supported 
in the literature. As Heim and Watts (1 967) point out, carryover effects are likely to 
be asymmetrical with probably more carryover from MC to CR-D than from CR-D to 
MC. Only with a multiple group design in which all combinations of format sequences 
are present, such as the one used in this study, that a separation of carryover effects, 
order effects, and format effects can be achieved without having to impose such 
assumption. 

The MC and the CR-D tests used in this study are stem-equivalent and scoring- 
equivalent. Thus, it can be safely assumed that the score scales are equivalent which 
makes it possible to compare their relative difficulty. 

Most published reports describe studies on examinees at grade eight or higher. 
Studies on younger children are difficult to find. It is also conceivable that the 
magnitude of format effects and/or carryover effects varies with the age of 
examinees. This study covers a range of 5 years from grades two to six which 
makes it possible to infer about whether or not carryover effects and/or format 
effects are uniform across the age range under study. 

5. Findings 

Table 1 shows the distribution of the participating students by grade and test 
sequence. Table 2 includes correlation coefficients of total scores from the two 
testings for students who were retested with the alternate format, Cronbach's alpha 
coefficient for internal consistency and correlation coefficients corrected for the 
attenuating effects of errors of measurement. The correlations ranged between 0.7 
and 0.85 and their corrected values ranged between 0.81 and 0.92 which are quite 
high. Reliability coefficients ranged between 0.83 and 0.9 for the MC tests and 
between 0.9 and 0.95 for the CR-D tests; being consistently higher than that of the 



MC tests of the same content. 

Differences between percent correct scores achieved in the first and the 
second testings were tested using paired t-tests. Table 3 includes only those test 
sequences in which differences were statistically significant. Three of four groups in 
grade 2 showed significant improvement in test scores, two groups in each of grades 

3 and 4 and only one in grade 6. That is, effects of repeated testing 
(practice/recall/carryover effects) are greater at younger ages. Paired t-tests on 
scores from first and second testings for students in the second grade indicate a 
significant recall effect in both groups retested with the same response format, i.e. 
CR-D/CR-D and MC-MC, with mean scores in the second testing significantly higher 
than mean scores from the first testing (p-value < 0.0005 for each group). Recall 
effects are found to be statistically significant for those students in the third and 
fourth grades who took the testing sequence CR-D/CR-D. Put another way, the test 
sequence CR/CR resulted in significant carryover/recall effects at grade levels 2,3 and 

4 while the test sequence MC/MC resulted in significant carryover effects in grade 2 
only (It seems to me like 'I do and I remember'). The magnitude of the improvement 
however, is not the same across grade levels. Although mean scores of the second 
testing were also found to be significantly higher than mean scores of the first testing 
for students in grades two, three and six in the testing sequence CR-D/MC, the 
source of this difference can not be decided from this analysis. 

Repeated measures analysis of variance indicate that, after adjusting for 
carryover effect and order effect, response format has a significant effect on the 
performance of students in grades two (p-value = 0.0001 ) and three (p- 
value = 0.0001) with the mean score on the MC format higher than the mean score 
on the CR-D format. Response format is not found to have a significant effect on the 
performance of students in grades four, five or six. The carryover factor has a 
significant effect on scores of students in grades two, three and four but not on 
scores of students in higher grades. 
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Table 1. Test Sequence versus Grade 



Grade 


CR/CR 


CR/MC 


MC/CR 


MC/MC 


Total 


2 


46 


60 


57 


56 


219 


3 


46 


64 


45 


65 


220 


4 


49 


49 


45 


68 


208 


5 


44 


51 


46 


54 


195 


6 


54 


52 


25 


55 


186 


Total 


239 


273 


218 


298 


1028 



Computation component of the Canadian Achievement Test (CAT/2). 
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Table 2. Reliability and Correlations 



Grade 


Corr of 
MC and CR 


Reliability 


Corrected 

corr 

coefficient 


MC 


CR 


2 


0.70 


0.83 


0.90 


0.81 


3 


0.82 


0.86 


0.93 


0.92 


4 


0.79 


0.88 


0.94 


0.87 


5 


0.85 


0.90 


0.95 


0.92 


6 


0.74 


0.88 


0.94 


0.81 



* As one would expect, reliability of CR format is consistently higher than that of MC 
format of the same content. 

* raw scores. 

Reliability = Cronbach's alpha. 
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Table 3. Paired t-tests on Percent Correct Scores of vs 2"*^ 



Testing, Significant Results Only 



Grade 


Test Sequence 


Sign, level, 
P 


N 


Effect 


2 


CR(36.04)-CR(62.71) 


<0.0005 


46 


carryover 




MC(69.02)-MC(77.40) 


< 


0.0005 


56 


carryover 




CR(56.03)-MC(62.24) 




0.006 


60 


format and/or 
carryover 


3 


CR(43.03)-CR(54.41) 


< 


0.0005 


46 


carryover 




CR(60.66)-MC(69.81) 


< 


0.0005 


64 


format and/or 
carryover 


4 


CR(52.14)-CR(62.04) 


< 


0.0005 


49 


carryover 




MC(63.39)-CR(75.61) 


< 


0.0005 


45 


format and/or 
carryover 


6 


CR(70.96)-MC(77.02) 


< 


0.0005 


52 


format and/or 
carryover 
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Table 4. Results of GLM Repeated Measures ANOVA 



(using adjusted sum of squares) 



Grade 


N 




Effect 


p-value 


2 


219 


84.9% 


format 


0.0001 








carryover 


0.0001 


3 


220 


92.1% 


format 


0.0001 








carryover 


0.0010 


4 


209 


90.9% 


format 


0.4565 








carryover 


0.0004 


5 


195 


95.0% 


format 


0.9252 








carryover 


0.7850 


6 


186 


94.0% 


format 


0.3847 








carryover 


0.6423 
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